Input: location(x,y), accuracy, timestamp Output: place_id
Assumption: - Generally, place_id is most closely related to ‘location+accuracy.’ - Since place_id is steady in space, and its open time is also decided. But the ‘time’ will be an auxiliary for ‘accuracy.’
sample <- fread("~/PycharmProjects/kaggle-project/facebook/sample_submission.csv")
Read 0.0% of 8607230 rows
Read 9.8% of 8607230 rows
Read 9.9% of 8607230 rows
Read 16.6% of 8607230 rows
Read 24.2% of 8607230 rows
Read 24.5% of 8607230 rows
Read 34.0% of 8607230 rows
Read 44.7% of 8607230 rows
Read 45.4% of 8607230 rows
Read 55.9% of 8607230 rows
Read 59.0% of 8607230 rows
Read 68.5% of 8607230 rows
Read 75.5% of 8607230 rows
Read 86.4% of 8607230 rows
Read 91.0% of 8607230 rows
Read 95.2% of 8607230 rows
Read 8607230 rows and 2 (of 2) columns from 0.328 GB file in 00:00:45
head(sample)
row_id place_id
1: 0 3073560757 9004412889 5652080691
2: 1 1652178628 4379515211 6612350960
3: 2 4894407065 3920195083 7608574746
4: 3 7272466660 2004687925 1455486822
5: 4 4720452725 4967325204 4782917866
6: 5 1283939222 8208358948 2070306016
library(data.table) #reading in the data
library(dplyr) #dataframe manipulation
library(ggplot2) #viz
library(ranger) #the random forest implementation
library(plotly) #3D plotting
library(tidyr) #dataframe manipulation
library(FNN) #k nearest neighbors algorithm
library(xgboost)
fb <- fread("~/PycharmProjects/kaggle-project/facebook/train.csv", integer64 = "character", showProgress = FALSE)
nrow(fb)
[1] 29118021
head(fb, 3)
row_id x y accuracy time place_id
1: 0 0.7941 9.0809 54 470702 8523065625
2: 1 5.9567 4.7968 13 186555 1757726713
3: 2 8.3078 7.0407 74 322648 1137537235
summary(fb)
row_id x y accuracy
Min. : 0 Min. : 0.000 Min. : 0.000 Min. : 1.00
1st Qu.: 7279505 1st Qu.: 2.535 1st Qu.: 2.497 1st Qu.: 27.00
Median :14559010 Median : 5.009 Median : 4.988 Median : 62.00
Mean :14559010 Mean : 5.000 Mean : 5.002 Mean : 82.85
3rd Qu.:21838515 3rd Qu.: 7.461 3rd Qu.: 7.510 3rd Qu.: 75.00
Max. :29118020 Max. :10.000 Max. :10.000 Max. :1033.00
time place_id
Min. : 1 Length:29118021
1st Qu.:203057 Class :character
Median :433922 Mode :character
Mean :417010
3rd Qu.:620491
Max. :786239
How to represent accuracy?
Assumption: - Since there are more than 100,000 places located in a 10 km by 10 km square. In this 0.25*0.25 area, there would supposed to be 625 unique place_id on average.
fb %>% filter(x >1, x <1.25, y >2.5, y < 2.75) -> fb_s
nrow(fb_s)
[1] 17710
fb %>% filter(x >3, x <3.25, y >2.5, y < 2.75) -> fb_s2
nrow(fb_s2)
[1] 15929
Since target is to classify place_id, 1st to observe place_id
Assumption: - From the figure 1&2, can see the trend of sorted place_id are almost same via different sample area. - From figure 3, some events are in the same place_id class, but there are continusely increasing place ids between 2 id like transition bridge. That maybe a wrong classification.
par(mfrow=c(3,1))
plot(sort(fb_s$place_id))
plot(sort(fb_s2$place_id))
plot(sort(fb_s2$place_id)[0:2000])
As to ass-2, from figures: seem no relation between place_id and accuracy
par(mfrow=c(2,1))
r_pla_aur=sort(fb_s2$place_id,index.return=TRUE)
plot(r_pla_aur$x[0:2000])
d=r_pla_aur$ix[0:2000]
plot(fb_s2[d,"accuracy"])
par(mfrow=c(2,1))
r_pla_aur=sort(fb_s2$place_id,index.return=TRUE)
plot(r_pla_aur$x[100:150])
d=r_pla_aur$ix[100:150]
plot(fb_s2[d,"accuracy"])
par(mfrow=c(2,1))
r_pla_aur=sort(fb_s2$place_id,index.return=TRUE)
plot(r_pla_aur$x[500:750])
d=r_pla_aur$ix[500:750]
plot(fb_s2[d,"accuracy"])
To count place_id
Assumption: - In this sample area, count by place_id, the 140 largest id seems to be correct class, but as estimation, there should be about 625 valid ids.
nrow(fb_s %>% count(place_id))
[1] 805
sort((fb_s %>% count(place_id))$n, decreasing = T)
[1] 1044 894 889 863 656 651 594 539 533 486 414 341 259 253
[15] 233 221 220 200 194 192 182 181 177 167 159 157 155 143
[29] 139 135 134 128 127 123 120 117 115 114 111 105 102 99
[43] 97 94 94 89 88 88 87 86 85 84 81 76 75 74
[57] 74 71 66 61 58 56 56 56 54 53 53 51 50 50
[71] 50 48 48 47 46 43 43 42 41 39 38 37 37 36
[85] 34 33 33 32 29 28 28 28 27 27 26 26 26 25
[99] 24 23 23 22 22 22 21 21 21 21 20 20 20 19
[113] 18 18 16 16 15 15 15 15 15 14 14 13 13 13
[127] 13 13 12 12 12 11 11 11 11 11 11 11 10 10
[141] 10 10 10 10 10 9 9 9 9 9 9 9 8 8
[155] 8 8 8 8 8 8 7 7 7 7 7 7 7 7
[169] 7 7 7 6 6 6 6 6 6 6 6 6 6 6
[183] 6 6 6 6 6 6 5 5 5 5 5 5 5 5
[197] 5 5 5 5 5 5 5 5 5 5 5 5 4 4
[211] 4 4 4 4 4 4 4 4 4 4 4 4 4 4
[225] 4 4 4 4 4 4 4 4 4 4 4 4 4 4
[239] 4 3 3 3 3 3 3 3 3 3 3 3 3 3
[253] 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[267] 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[281] 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[295] 3 2 2 2 2 2 2 2 2 2 2 2 2 2
[309] 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[323] 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[337] 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[351] 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[365] 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[379] 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[393] 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[407] 2 2 2 1 1 1 1 1 1 1 1 1 1 1
[421] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[435] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[449] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[463] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[477] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[491] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[505] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[519] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[533] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[547] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[561] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[575] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[589] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[603] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[617] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[631] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[645] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[659] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[673] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[687] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[701] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[715] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[729] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[743] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[757] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[771] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[785] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[799] 1 1 1 1 1 1 1
fb_s$hour = (fb_s$time/60) %% 24
fb_s$weekday = (fb_s$time/(60*24)) %% 7
fb_s$month = (fb_s$time/(60*24*30)) %% 12 #month-ish
fb_s$year = fb_s$time/(60*24*365)
fb_s$day = (fb_s$time/(60*24)) %% 365
head(fb_s)
row_id x y accuracy time place_id hour weekday
1 600 1.2214 2.7023 17 65380 6683426742 9.666667 3.4027778
2 957 1.1832 2.6891 58 785470 6683426742 11.166667 6.4652778
3 4345 1.1935 2.6550 11 400082 6889790653 20.033333 4.8347222
4 4735 1.1452 2.6074 49 514983 6822359752 15.050000 0.6270833
5 5580 1.0089 2.7287 19 732410 1527921905 14.833333 4.6180556
6 6090 1.1140 2.6262 11 145507 4000153867 1.116667 3.0465278
month year day
1 1.513426 0.1243912 45.40278
2 6.182176 1.4944254 180.46528
3 9.261157 0.7611910 277.83472
4 11.920903 0.9798002 357.62708
5 4.953935 1.3934741 143.61806
6 3.368218 0.2768398 101.04653
summary(fb_s)
row_id x y accuracy
Min. : 600 Min. :1.000 Min. :2.500 Min. : 1.00
1st Qu.: 7327816 1st Qu.:1.049 1st Qu.:2.574 1st Qu.: 25.00
Median :14430714 Median :1.123 Median :2.642 Median : 62.00
Mean :14505688 Mean :1.123 Mean :2.632 Mean : 82.48
3rd Qu.:21634631 3rd Qu.:1.190 3rd Qu.:2.688 3rd Qu.: 75.00
Max. :29112154 Max. :1.250 Max. :2.750 Max. :1004.00
time place_id hour weekday
Min. : 119 Length:17710 Min. : 0.000 Min. :0.000
1st Qu.:174070 Class :character 1st Qu.: 6.633 1st Qu.:1.780
Median :403388 Mode :character Median :11.950 Median :3.317
Mean :397551 Mean :12.019 Mean :3.443
3rd Qu.:602112 3rd Qu.:17.567 3rd Qu.:5.183
Max. :786218 Max. :23.983 Max. :6.999
month year day
Min. : 0.000532 Min. :0.0002264 Min. : 0.0222
1st Qu.: 2.016146 1st Qu.:0.3311829 1st Qu.: 59.8905
Median : 4.023229 Median :0.7674800 Median :120.5826
Mean : 4.698002 Mean :0.7563761 Mean :142.2165
3rd Qu.: 7.041325 3rd Qu.:1.1455703 3rd Qu.:215.5274
Max. :11.999907 Max. :1.4958486 Max. :364.8056
Split data
Since fb_s has 17710 rows, after sorted by time, split the data by 0.9 vs 0.1, so take the earlier 16000 events as training data, and the remaining to be valid data, and the 16000th data is time==713568, so we use 7.1e5 to be the filter.
nrow(fb_s)
[1] 17710
sort(fb_s$time)[16000]
[1] 713568
small_train = fb_s[fb_s$time < 7.1e5,]
small_val = fb_s[fb_s$time >= 7.1e5,]
visualize 2D: small_train
Visualize small train data by x,y, colored by place_id, since there are some overlap data, will use time or accuracy to sepreate.
Assumption: - There supposed to be unique (x,y) will have only 1 place_id, or to say, within the scope of a cluster, the (x,y)s share the same place id.
ggplot(small_train, aes(x, y )) +
geom_point(aes(color = place_id)) +
theme_minimal() +
theme(legend.position = "none") +
ggtitle("Check-ins colored by place_id")
visualize 3D: small_train with place_id count>500 z=hour
Count by place_id, will use the largest 8 group of place id to be colored.
sort((small_train %>% count(place_id))$n, decreasing = T)[0:140]
[1] 949 831 768 727 651 577 541 505 485 453 369 303 246 238 204 201 198
[18] 195 168 166 164 160 159 155 154 153 147 130 124 122 121 114 113 109
[35] 107 101 100 95 93 93 89 89 89 87 84 83 81 79 79 76 75
[52] 71 71 70 68 68 62 60 58 54 53 53 53 52 52 48 47 47
[69] 46 42 41 40 40 39 39 38 36 34 33 31 31 31 30 30 28
[86] 28 27 27 27 26 26 26 25 25 24 24 24 23 23 23 23 22
[103] 21 21 20 20 20 19 19 19 18 16 15 15 15 15 14 13 13
[120] 13 12 12 12 11 11 11 11 10 10 10 10 10 10 9 9 9
[137] 9 9 9 9
Observation: - From the 3D z=hour, place_id is determined by (x,y), will note change with time. - There are overlap maybe by mistaken place_id, such as, in the green cluster, there are pink, orange and meat colors. These data should be modified or cleaned. - Almost every cluster has dense and sparse area. This maybe another potential feature. - Some clusters have singular points, such as gray and orange cluster. - Each place_is has its own open time, some are whole-day, some late night, etc.
Assumption: - According to the problem introduction, ‘Inconsistent and erroneous location data’, for example, for the orange labeled point x located in the meat scope, the label is correct, but the location should not be there, but when the algorithm’s leanring, this point will be identified as meat area. So is given a new data, it has the same location and accuracy, it should be classified as orange rather than meat. How?
small_train %>% count(place_id) %>% filter(n > 500) -> ids
#if n>200, warning: n too large, allowed maximum for palette Set2 is 8
small_trainz = small_train[small_train$place_id %in% ids$place_id,]
plot_ly(data = small_trainz, x = x , y = y, z = hour, color = place_id, type = "scatter3d", mode = "markers", marker=list(size= 5)) %>% layout(title = "Place_id's by position and Time of Day")
z=week
plot_ly(data = small_trainz, x = x , y = y, z = weekday, color = place_id, type = "scatter3d", mode = "markers", marker=list(size= 5)) %>% layout(title = "Place_id's by position and Day of Week")
Count unique place_id
length(unique(small_train$place_id))
[1] 755
Ignore fewer place_id
small_train %>% count(place_id) %>% filter(n > 3) -> ids
small_train = small_train[small_train$place_id %in% ids$place_id,]
summary(small_train)
row_id x y accuracy
Min. : 600 Min. :1.000 Min. :2.500 Min. : 1.00
1st Qu.: 7352971 1st Qu.:1.048 1st Qu.:2.575 1st Qu.: 24.00
Median :14451675 Median :1.123 Median :2.644 Median : 62.00
Mean :14521486 Mean :1.123 Mean :2.633 Mean : 80.23
3rd Qu.:21632784 3rd Qu.:1.191 3rd Qu.:2.688 3rd Qu.: 75.00
Max. :29112154 Max. :1.250 Max. :2.750 Max. :1000.00
time place_id hour weekday
Min. : 203 Length:15180 Min. : 0.000 Min. :0.000694
1st Qu.:158300 Class :character 1st Qu.: 6.667 1st Qu.:1.771528
Median :360931 Mode :character Median :11.917 Median :3.287500
Mean :356480 Mean :12.013 Mean :3.419828
3rd Qu.:552239 3rd Qu.:17.533 3rd Qu.:5.145139
Max. :709999 Max. :23.983 Max. :6.999306
month year day
Min. : 0.000532 Min. :0.0003862 Min. : 0.0222
1st Qu.: 1.791412 1st Qu.:0.3011806 1st Qu.: 53.2045
Median : 3.614051 Median :0.6867028 Median :108.2149
Mean : 4.585452 Mean :0.6782347 Mean :139.5224
3rd Qu.: 7.408015 3rd Qu.:1.0506830 3rd Qu.:226.9903
Max. :11.999907 Max. :1.3508352 Max. :364.8056
s = 2
l = 125
w = 500
create_matrix = function(train) {
cbind(s*train$y,
train$x,
train$hour/l,
train$weekday/w,
train$year/w,
train$month/w,
train$time/(w*60*24*7))
}
X = create_matrix(small_train)
X_val = create_matrix(small_val)
KNN
model_knn = FNN::knn(train = X, test = X_val, cl = small_train$place_id, k = 15)
preds <- as.character(model_knn)
truth <- as.character(small_val$place_id)
mean(truth == preds)
[1] 0.5036415
head(X)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 5.4046 1.2214 0.077333333 0.006805556 0.0002487823 0.003026852
[2,] 5.3100 1.1935 0.160266667 0.009669444 0.0015223820 0.018522315
[3,] 5.2524 1.1140 0.008933333 0.006093056 0.0005536796 0.006736435
[4,] 5.0006 1.1449 0.135600000 0.005412500 0.0012038699 0.014647083
[5,] 5.0374 1.2015 0.128266667 0.007336111 0.0026283181 0.007977870
[6,] 5.4646 1.1916 0.042533333 0.010443056 0.0012176522 0.014814769
[,7]
[1,] 0.01297222
[2,] 0.07938135
[3,] 0.02887044
[4,] 0.06277321
[5,] 0.13704802
[6,] 0.06349187
Random Forest
set.seed(131L)
small_train$place_id <- as.factor(small_train$place_id) # ranger needs factors for classification
model_rf <- ranger(place_id ~ x + y + accuracy + hour + weekday + month + year,
small_train,
num.trees = 100,
write.forest = TRUE,
importance = "impurity")
Growing trees.. Progress: 94%. Estimated remaining time: 1 seconds.
pred = predict(model_rf, small_val)
pred = pred$predictions
accuracy = mean(pred == small_val$place_id)
accuracy
[1] 0.5518207
Visualize RF accuracy It does seem that the correctly identified check-ins are more “clustered” while the wrongly identified ones are more uniformly distributed but other than that no clear patters here.
small_val$Correct = (pred == small_val$place_id)
ggplot(small_val, aes(x, y )) +
geom_point(aes(color = Correct)) +
theme_minimal() +
scale_color_brewer(palette = "Set1")
look at what kind of id’s our random forest gets wrong We see below that our model is doing actually really great on the more popular id’s(more blue on the right). However it loses when it looks at id’s that appear only a few times.
#reordering the levels based on counts:
small_val$place_id <- factor(small_val$place_id,
levels = names(sort(table(small_val$place_id), decreasing = TRUE)))
small_val %>%
ggplot(aes(x = place_id)) + geom_bar(aes(fill = Correct)) +
theme_minimal() +
theme(axis.text.x = element_blank()) +
ggtitle("Prediction Accuracy by ID and Popularity") +
scale_fill_brewer(palette = "Set1")
importance of our variables 1. y variable is more important than the x This means that the y axis is a better predictior of place_id and the random forest figures this out on its own. 2. hour and other time features are also good predictiors but less so than the spatial features - this makes sense since the location of a check-in should be more important than the time of the check-in. 3. Accuracy is a bit misterious since we don’t get an actual definition for it, but at least the model tells us it’s somewhat important.
data.frame(as.list(model_rf$variable.importance)) %>% gather() %>%
ggplot(aes(x = reorder(key, value), y = value)) +
geom_bar(stat = "identity", width = 0.6, fill = "grey") +
coord_flip() +
theme_minimal() +
ggtitle("Variable Importance (Gini Index)") +
theme(axis.title.y = element_blank())